Last week we introduced some of the key motivations behind Environmental Statistics.
The course will cover a number of statistical ideas around the general theme of environmental data.
This week we will be looking at uncertainty and variability, and how we can measure these and incorporate them into our conclusions.
We will then look at a number of important features of environmental data — censoring, outliers and missing data.
We often talk about uncertainty and error as though they are interchangeable, but this is not quite correct.
Error is the difference between the measured value and the “true value” of the thing being measured.
Uncertainty is a quantification of the variability of the measurement result.
Practically speaking, we make use of common statistical distributions to account for uncertainty.
A continuous random variable \(X\) follows a normal distribution with mean \(\mu\) and standard deviation \(\sigma\) if its probability density function (pdf) is:
\[ f(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2} \]
We denote this as:
\[ X \sim \mathcal{N}(\mu, \sigma^2), ~\text{where} ~ -\infty < X < +\infty \]
Why can’t we just use normal distributions for all environmental data?
A random variable \(X\) follows a log-normal distribution if \(\ln(X)\) follows a normal distribution,i.e.
\[ Y = \ln(X) \sim \mathcal{N}(\mu, \sigma^2) \quad \text{where}~ Y\in (0, +\infty) \]
A random variable \(X\) follows an exponential distribution with rate parameter \(\lambda >0\) if its probability density function (pdf) is:
\[ f(x; \lambda) = \begin{cases} \lambda e^{-\lambda x} & \text{for } x \geq 0 \\ 0 & \text{for } x < 0 \end{cases} \]
\(\lambda\) describes the rate of events, i.e., the no. of events per unit time/distance
A discrete random variable \(X\) follows a Poisson distribution with rate parameter \(\lambda > 0\) if its probability mass function (PMF) is:
\[ P(X = k) = \frac{\lambda^k e^{-\lambda}}{k!}, ~ k = 0, 1, \dots \]
We denote this as \(X \sim Po(\lambda)\) where \(\lambda\) describes:
A discrete random variable \(X\) follows a binomial distribution with parameters \(n\) and \(p\) if:
\[ P(X = k) = \binom{n}{k} p^k (1-p)^{n-k}, \quad k = 0, 1, 2, \dots, n \]
We denote this as \(X \sim Bi(n, p)\) where:
\(n\) = number of independent trials
\(p\) = probability of success in each trial
\(k\) = number of successes observed
Survival studies: \(n\) animals, each with survival probability \(p\)
Detection/non-detection: \(n\) surveys, probability \(p\) of detecting species
A discrete random variable \(X\) follows a negative binomial distribution with parameters \(r\) and \(p\) if:
\[ P(X = k) = \binom{k + r - 1}{k} (1-p)^r p^k, ~ k = 0, 1, \dots \]
The distribution of the number of trials until the \(r\)th success is denoted by \(X\sim \mathrm{NegBi}(r,p)\) Where
All bathing water sites in Scotland are classified by SEPA as “Excellent”, “Good”, “Sufficient” or “Poor” in terms of how much faecal bacteria (from sewage) they contain.
The minimum standard all beaches or bathing water must meet is “Sufficient”.
The sites are classified based on the 90th and 95th percentiles of samples taken over the four most recent bathing seasons.
Green is excellent , blue is good, red is sufficient
The classification system assumes that bacterial concentrations at each site follow a log-normal distribution.
If this assumption does not hold, the classifications would not be accurate.
Therefore, it is crucial that we regularly assess this assumption to ensure the safety of our bathing water.
We can use our standard residual plots to assess log-normality.
The top plots show the standard residuals and the bottom plots show the residuals for the log-transformed data.
There is no strong evidence to suggest we have breached our assumptions.
Error in a measurement is the difference between the measured value and the true value.
Random error: Variation observed randomly over repeat measurements.
→ With more measurements, these errors average out (improves accuracy).
Systematic error: Variation that remains constant over repeated measures.
For each example, identify whether the error is random or systematic:
A meter reads 0.01 even when measuring no sample.
An old thermometer can only measure to the nearest 0.5 degrees.
A poorly designed rainfall monitor often leaks water on windy days.
To estimate the abundance of a fish species in a lake, scientists use a net with a mesh size equal to the average fish length
Key takeaway: Random errors can be reduced by averaging; systematic errors require calibration, better instruments, or method changes.
\[\text{estimated value } \pm \text{ standard uncertainty}\]
We can use these to compute the standard uncertainty of the mean log(FS) as \[u = \frac{1.427}{\sqrt{80}} = 0.160.\]
This would therefore give a 95% confidence interval for the mean of log(FS) of \[3.861 \pm 1.96 \times 0.160 = (3.574, 4.175).\]
Sometimes, we have a result \(Y\) that is obtained from the values of \(n\) other quantities \(X_1, \dots, X_n\).
The combined uncertainty \(u(Y)\) of a linear combination \(Y = a + b_1X_1 + \dots + b_nX_n\) (where \(a, b_1, \dots, b_n\) are constants) is calculated as follows:
Combined uncertainty
\[u(Y) = \sqrt{\sum_{i=1}^{n}\sum_{j=1}^{n}\left(u(X_i)\times u(X_j) \times b_i \times b_j \times \rho_{ij}\right)}\]
where \(u(X_i)\) and \(u(X_j)\) are the standard uncertainties of \(X_i\) and \(X_j\), respectively, and \(\rho_{ij}\) is the correlation between \(X_i\) and \(X_j\).
Combined uncertainty (independence)
\[u(Y) = \sqrt{\sum_{i=1}^{n}\left(u(X_i)^2 \times b_i^2\right)}\]
General uncertainty propagation formula
The standard uncertainty of \(Y = f (X_1, ..., X_n)\) is:
\[u(Y) = \sqrt{\sum_{i=1}^n f \ '(\mu_i)^2u(X_i)^2}\]
where \(f \ '(\mu_i)\) is the partial derivative of \(Y\) with respect to \(X_i\) evaluated at its mean \(\mu_i\).
The area \(A\) of a rectangle with height \(h\) and width \(w\) is \(A = h \times w\).
Height and width are measured with uncertainty, \(u(h)\) and \(u(w)\), respectively.
Evaluate the uncertainty on the area \(A\).
\[u(Y) = \sqrt{\sum_{i=1}^n f \ '(\mu_i)^2u(X_i)^2}\]
The area \(A\) of a rectangle with height \(h\) and width \(w\) is \(A = h \times w\).
Height and width are measured with uncertainty, \(u(h)\) and \(u(w)\), respectively.
Evaluate the uncertainty on the area \(A\).
\[u(Y) = \sqrt{\sum_{i=1}^n f \ '(\mu_i)^2u(X_i)^2}\]
\(u(A) = u(h \times w)\)
\(\frac{df}{dh} = w \ \ \text{and} \ \ \frac{df}{dw} = h\)
\(\therefore u(A) = \sqrt{w \ ^2 u(h)^2 + h \ ^2 u(w)^2}\)
We often talk about the quality of a measurement process (or an associated estimate) in terms of accuracy, bias and precision.
Bias:
Measurement bias: is the difference between the average of a series of measurements and the true value - mainly due to faulty measuring devices of procedures (systematic error).
Sampling bias: Under-representative sample of the target population (systematic error).
Estimation bias: Relates to the property of an estimator, i.e., \(E(\hat{\theta})-\theta = 0\), for unbiased estimators, the bias (random error) decreases with increased sampling effort (See supplementary material for more details).
Precision is the closeness of agreement between independent measurements. Precision does NOT relate to the true value.
Accuracy overall distance between the estimated (or observed) values and the true value. There are several definition of what this distance mean some of which include the precision (see Walther and Moore (2005))
Over the last decade, the information available for surveying and monitoring ecological and environmental resources has changed radically.
The rise of new technologies facilitates the access to large volumes of environmental and ecological data.
Today’s ecological and environmental data landscape is overwhelmingly vast - far too extensive to cover comprehensively in one session!
Instead, we’ll focus on key data sources
primary source of information for long-term environmental assessment, producing structured datasets
field surveys conducted on established monitoring networks to track trends in species populations, habitat quality, and ecosystem processes
Planned Surveys produce structured data which involves constant monitoring schemes using standardised methods at sites on a regular basis.
Minimizing observational error & sampling biases.
primary source of information for long-term environmental assessment, producing structured datasets
field surveys conducted on established monitoring networks to track trends in species populations, habitat quality, and ecosystem processes
Planned Surveys produce structured data which involves constant monitoring schemes using standardised methods at sites on a regular basis.
Minimizing observational error & sampling biases.
These are expensive to collect and tend to be geographically and temporally restricted.
| Monitoring Scheme | Description |
|---|---|
| United Kingdom Butterfly Monitoring Scheme (UKBMS) | Protocolized sampling scheme run by butterfly conservation that has monitored changes in the abundance of butterflies throughout the United Kingdom since 1976. |
| UK Environmental Change Network (ECN) | UK’s long-term ecosystem monitoring and research programme that has produced a large collection of publicly available data sets including meteorological, biogeochemistry and biological data for different taxonomic groups (Rennie et al. 2020). |
| National Hydrological Monitoring Programme (NHMP) | The NHMP, particulalry the National River Flow Archive conveys a national scale management of hydrological data within the UK hosted by the UKCEH since 1982 collating hydrometric data from gauging station networks operated by multiple agencies. |
| Natural Capital and Ecosystem Assessment (NCEA) | Long-term environmental monitoring of natural capital including data from freshwater Surveillance Networks, ecosystem condition & soil health, forest inventory, estuary and coast surveillance, etc. |
| Breeding Bird Survey (BBS) | Main scheme for monitoring the population changes of the UK’s common breeding birds. It covers all habitat types and monitors 110 common and widespread breeding birds using a randomised site selection. |
Unstructured data constitute the majority of available information.
Unstructured data constitute the majority of available information.
| Advantages 😄👍 | Disadvantages 😔👎 |
|---|---|
| Extensive taxonomic, spatial and temporal coverage. | Under-reporting of rare and inconspicuous species. |
| Eye-catching species that are easily identifiable by participants. | Varying recording skills and uneven sampling effort. |
Large volumes of CS data come from Opportunistic surveys where sampling effort is biased across space and time.
Elevation versus sampling effort (obtained through the Pl@net Net App) in the French mediterranean region (Figure taken from (Botella et al. 2020)).
Small populations at lower elevation could be over-sampled.
If we assume sampling is evenly distributed, species distribution at higher elevation would be under-estimated
Oldest form of historical data reservoirs driven originally by personal interest but provedn to be a key source of information for addressing modern global challenges
The Natural History Museum in London safeguards a collection of over 80 million specimens, spanning 4.5 billion years of Earth’s history to the present.
Most historic collection were obtained in an opportunistic manner - largely dependent on the particular interests of the collector)
The information associated with each collection or specimen vary widely, limiting the environmental context.
Centralized, curated platforms that aggregate, preserve, and disseminate environmental data
Examples:
Global Biodiversity Information Facility (GBIF)
National Biodiversity Network (NBN) Atlas
UK-SCAPE plant diversity trends
UK Lakes portal
Key Features:
Standardize heterogeneous datasets
Enable cross-disciplinary data sharing
Often include interactive data portals with:
Visualization tools
Web applications
Programming interfaces (APIs)
Data catalogues
Processed information products transform raw measurements into refined, analysis-ready resources tailored for decision-makers and researchers.
Unlike primary data repositories, these products undergo rigorous calibration, integration, and modelling to generate authoritative maps, indicators, and synthesized datasets.
Example: Worlclim
WorldClim is a widely used set of global, high-resolution climate surfaces (raster maps) that provide interpolated estimates of historical and future projections of temperature, precipitation, and other bioclimatic variables.
These surfaces serve as the foundational data for species distribution modeling, ecological forecasting, and a vast range of other environmental research applications.
Remote sensing refer the process of obtaining information of an object from a distance, typically from aircraft or satellites
Enables non-invasive monitoring of Earth’s environment across vast scales, generating products like land cover maps and vegetation indices
Provides systematic, near-real-time data but has substantial uncertainties from sensor calibration, resolution constraints, and lower accuracy than field measurements
Requires validation with in-situ data to assess and ensure accuracy of remote sensing products
Digital Elevation Models (DEMs)
DEMs are digital representations of the earth’s topographic surface providing a continuous and quantitative model of terrain morphology.
The accuracy of DEMs is determined primarily by the resolution of the model (the size of the area represented by each individual grid cell in a raster).
Example: Shuttle RaDAR Topography Mission (SRTM), aquired by NASA using a Synthetic Aperture Radar (SAR) instrument, provide elevation data for any country
Land Cover Maps
Land cover maps describe the physical material on the Earth’s surface.
They are created by applying automated algorithms to satellite or aerial imagery to identify features such as grassland, woodland, rivers & lakes or man-made structures such as roads and buildings.
Example: UK CEH Land Cover Maps provide consistent national-scale representations of surface vegetation and land use classes.
NDVI Vegetation Index
Vegetation indeces derived from remote sensing utilize spectral data from satellite or aerial sensors to quantify and monitor plant health, structure, and function across landscapes.
The Normalized Difference Vegetation Index (NDVI ranges from -1 to +1, where positive values indicating healthier, denser vegetation and negative values indicating surfaces like water, snow, or bare ground.
Research-generated data repositories, such as Dryad and Zenodo, are cornerstone platforms in the modern scientific workflow, explicitly designed to uphold the principles of transparency, reproducibility, and open data access.
Core Features:
Researchers actively deposit datasets, code, and scripts
Assign persistent DOIs for citation and access
Enable verification and replication of findings
Impact:
Detects errors & reduces redundancy
Accelerates scientific discovery
Transforms single studies into community resources
Safeguards scientific integrity
Environmental and Ecological systems are inherently complex due to the large number of interrelated biological, physical, and social components
Adding to this complexity, analyzing these systems becomes a challenging task due to the heterogeneity of available data and the different sources of uncertainty that impact the quality of the data
Data collection methods vary widely and spatial and temporal sampling schemes may be too sparse to fully capture overall system behavior. Consequently, we often have to deal with issues such as outliers, missing values, and highly uncertain information.
Many of these data quality issues can be addressed through a rigorous data pre-processing and through statistical models that explicitly account for the observational process.
Important
Data pre-processing is crucial stage in any sort of ecological or environmental data analysis and it includes data cleaning, outlier detection, missing value treatment, handling censored data, transformation, and the creation of new derived variables.
The goal is to create a robust, consistent dataset ready for analysis while carefully documenting all changes to preserve the integrity of the original information.
Censored data are data where we are restricted in our knowledge about them in some way or other.
Often this will be because we only know that the data value lies below a certain minimum value (or above a certain maximum).
For example, if we had scales which only weighed up to 10kg, we would not know the exact weight of any object greater than 10kg.
Censored data are data where we are restricted in our knowledge about them in some way or other.
Often this will be because we only know that the data value lies below a certain minimum value (or above a certain maximum).
For example, if we had scales which only weighed up to 10kg, we would not know the exact weight of any object greater than 10kg.
For environmental data, it is more common to have data which are censored at some minimum value.
This is because many pieces of measuring equipment will have an analytical limit of detection.
A limit of detection is the lowest concentration that can be distinguished with reasonable confidence from a “blank”, i.e. a hypothetical sample with a value of zero.
The limit of detection is often denoted \(c_L\).
Example
Your environmental monitoring device measures a pollutant concentration of 0.05 ppm, but the instrument’s Limit of Detection (LoD, \(c_L\)) is 0.1 ppm. Is this 0.05 ppm a measurement of real pollution?
We can’t say with confidence. The LoD of 0.1 ppm represents the lowest concentration that can be reliably distinguished from a blank sample.
At 0.05 ppm (below LoD), we cannot confidently tell if it’s real pollution at a low level or just measurement noise
Censoring has a huge impact on how we interpret our data.
The two plots below show the same data, but the right panel is ‘censored’ with two different limits of detection (some with an LOD of 0.5, others with an LOD of 1.5).
Censored observations are not completely without information. We still know they are equal to or more extreme than the limit.
For a LOD, we might therefore report the datapoint as either “not detected” or “\(< c_L\)”.
Removing them from our study would not be sensible, since this would lead to us overestimating the mean and probably also underestimating the variance.
We need to find a way to incorporate these censored datapoints into our analysis.
We can’t simply use the minimum value of the LOD. This would ignore the fact that the values are often below this.
In the plot below, the LOD reduces after every 100 observations (e.g. because of better quality equipment), and this leads to an artificial trend.
The simplest approach for dealing with LODs is via simple substitution.
This involves taking the LOD value and multiplying it by a fixed constant, e.g. replacing all \(<c_L\) values with \(0.5c_L\).
This approach is fairly popular because it is simple and easy to implement.
However, this approach only works if there is a small proportion of censored data (maximum 10–15%). If there is a higher proportion, it tends to overestimate the mean.
It is generally preferable to use a more statistics-based approach which accounts for the data distribution.
The basic idea is that we estimate the statistical distribution of the data in a way that takes into account the censoring.
We can then use this estimated distribution to simulate values for our censored points.
Commonly used distribution-based approaches are Maximum Likelihood, Kaplan-Meier and Regression on Order Statistics.
The maximum likelihood (ML) approach is a parametric approach, i.e. it requires us to specify a statistical distribution that is a close fit to the data.
We then identify the parameters of this distribution that maximise the likelihood of obtaining a dataset like ours.
This ML approach has to take into account the likelihood of obtaining:
Advantages
Able to handle multiple limits of detection.
Good for estimating summary statistics with a suitably large sample size.
MLE explicitly accounts for the underlying distribution of the data (if known).
Disadvantages
More applicable to larger datasets (n > 50).
Reliant on specifying the correct distribution, otherwise estimates can be incorrect.
Transforming data to fit a distribution can potentially cause biased estimators.
The Kaplan-Meier approach is a nonparametric approach, i.e. it doesn’t require a distributional assumption.
It’s often used in survival analysis for estimating summary statistics for right-censored data.
However, it can be applied to left-censored data by ‘flipping’ the data and subtracting from a fixed constant.
In survival analysis, Kaplan-Meier estimates the probability that an observation will survive past a certain time.
In our ‘flipped’ context, it gives the probability that an observation will fall below the limit of detection.
Cadmium is a heavy metal identified as having potential health risks.
We observed cadmium levels in fish livers in two different regions of the Rocky Mountains.
Due to variation in data collection, there are four different LODs (0.2, 0.3, 0.4 and 0.6 µg per litre).
| Cd | Region | CdCen |
|---|---|---|
| 81.3 | SRKYMT | FALSE |
| 3.5 | SRKYMT | FALSE |
| 4.6 | SRKYMT | FALSE |
| 0.6 | SRKYMT | FALSE |
| 2.9 | SRKYMT | FALSE |
| 3.0 | SRKYMT | FALSE |
| 4.9 | SRKYMT | FALSE |
| 0.6 | SRKYMT | FALSE |
| 3.4 | SRKYMT | FALSE |
| 0.4 | COLOPLT | FALSE |
| 0.8 | COLOPLT | FALSE |
| 0.3 | COLOPLT | TRUE |
| 0.4 | COLOPLT | FALSE |
| 0.4 | COLOPLT | FALSE |
| 0.4 | COLOPLT | TRUE |
| 1.4 | COLOPLT | FALSE |
| 0.6 | COLOPLT | TRUE |
| 0.7 | COLOPLT | FALSE |
| 0.2 | SRKYMT | TRUE |
Plotting the data shows the potential impact of censoring.
The left panel shows all the data (plotting censored values as equal to the LOD), while the right panel excludes those which have been censored.
We can use the NADA (Nondetects and Data Analysis) package in R.
The cenfit function applies the Kaplan-Meier method. This package automatically ‘flips’ the data, since it is designed for environmental data.
n n.cen median mean sd
Cadmium$Region=COLOPLT 9 3 0.4 0.5888889 0.3519259
Cadmium$Region=SRKYMT 10 1 3.0 10.5400000 25.0689539
The cendiff function tests for significant differences between the groups.
This uses a chi-squared hypothesis test:
\(H_0\): Median cadmium levels are the same in Region 1 and Region 2
\(H_1\): Median cadmium levels are different in Region 1 and Region 2
N Observed Expected (O-E)^2/E (O-E)^2/V
Cadmium$Region=COLOPLT 9 2.84 6.13 1.76 7.02
Cadmium$Region=SRKYMT 10 6.84 3.55 3.05 7.02
Chisq= 7 on 1 degrees of freedom, p= 0.008
We can also plot the empirical cumulative distribution function (ECDF), taking into account the LODs.
Note that this works in the opposite direction from regular survival plots due to the ‘flipping’ of the data.
Advantages
Nonparametric — no need to assume underlying distribution.
Can easily account for multiple LODs.
Works for large numbers of censored datapoints (>50%).
Disadvantages
Quite simplistic — identical to simple substitution if we only have one LOD.
Less reliable for values near and below the LOD.
The mean tends to be overestimated — need to rely on median.
Regression on Order Statistics is a semi-parametric approach, i.e. it combines elements of parametric and nonparametric models.
It follows a two-step approach:
Plot the uncensored values on a probability plot (QQ plot) and use linear regression to approximate the parameters of the underlying data distribution.
Use this fitted distribution to impute estimates for the censored values.
There is an assumption that the censored measures are normally (or lognormally) distributed.
The plot shows the uncensored points and their probability plot regression model.
The NADA package in R uses lognormal as default. The plot suggests that this is sensible.
We then use this fitted model to estimate the values of the censored observations, based on their normal quantiles.
We can compare our ROS approach to simple substitution for the bathing water example used earlier.
The left panel (ROS) shows no trend present; the right panel (simple substitution) has an artificial trend.
Advantages
Can be applied to a wide variety of environmental datasets.
Works with multiple LODs, but still not too simplistic with a single LOD.
Can be used with up to 80% censored datapoints.
Disadvantages
Semiparametric approach — requires a distributional model to be assumed.
Specifically requires normality (or lognormality) for estimation of parameters.
Two-stage model introduces extra source of variability.
An outlier is an extreme or unusual observation in our dataset.
These will often (but not always) have a large influence on the outcomes of our analysis.
We have to find ways to identify and deal with outliers.
Can you think of any examples of outliers?
There are two main categories of outlier:
Note
Check notes material for a more detailed description of these tests
Environmental data are very prone to missing values.
Data can be missing for any number of reasons.
There’s a whole discipline of statistics related to this. We will just touch on the topic.
Adverse weather (e.g., rainfall, snow, drought and wind) can affect measuring equipment or prevent access to the location.
Failure of scientific equipment.
Samples being lost or damaged.
Monitoring networks change in size over time. (Data are “missing” before the site is introduced or after it is removed.)
The technique we use to deal with missing data depends on the type of missingness.
If there are a handful of datapoints missing at random, we can essentially ignore this and carry out our analysis as usual.
However, if they are missing in some sort of systematic way (e.g., a whole month missing due to bad weather), we may instead look at some form of imputation.
Imputation is a process that involves predicting the missing values via some form of statistical method.
There are two main forms of imputation:
Single imputation has the advantage of being simpler, and allows straightforward analysis once the missing values have been estimated.
Multiple imputation does a better job of accounting for the uncertainty of the imputation process, but makes the final analysis more complex.
Our approach for generating the imputed value will vary depending on the context.
In the simplest case, we may replace missing values with the overall mean (usually only if we have very limited information).
More commonly, we may use neighbouring values, or some form of seasonal mean.
These will usually work reasonably well as long as we do not have too much missing data.
A more complex approach is to fit a more general statistical model, perhaps taking account of other variables and/or using random components.
Error is the difference between the measured value and the “true value” of the quantity being measured.
Uncertainty is a quantification of the variability of the measurement result.
Error includes two components:
Uncertainty can be expressed as:
Bias is the difference between the average of a series of measurements and the true value.
Precision is the closeness of agreement between independent measurements.
Accuracy is the distance between the estimated (or observed) values and the true value.
| Source 📅 | Advantages 😀 | Disadvantages 😒 |
|---|---|---|
| Monitoring programmes | Minimises sources of bias through design | Costly and temporally and geographically restricted |
| Citizen Science | Cost effective, large spatio-temporal coverage | Biased towards certain species and places that are easy to access or of public interest. |
| Biological collections | Large historical collections preserved in collections. | The data associated with each collection varies widely. Also, information about the sampling is often missing and there are also important sources of spatial and taxonomic bias. |
| Data repositories | Store large collection of data sources which are often publicly available. | Data are often standardized (losing information) or summarised to a particular spatial resolution. Contain varying data source some of which can be biased. |
| Processed products | Undergo rigorous calibration, integration, and modelling to generate high quality data. | Not always licence-free or publicly available. |
| Research Generated Data | If available, they provide high quality data,scripts and code that can be cited and provides transparency and reproducibility. | If available sometimes is a big If. Also, sometimes code gets outdated or developers do not longer maintain it. |
We are restricted in our knowledge about censored data.
The limit of detection (LoD) \(c_L\) is the lowest concentration that can be distinguished with reasonable confidence from a “blank” (a hypothetical sample with a value of zero).
We can address LoDs through:
An outlier is an extreme or unusual observation in our dataset.
Can be identified via:
Missing data can be missing at random, or systematically.
Systematic missingness may require imputation, e.g.: